stack overflow
StackEval: Benchmarking LLMs in Coding Assistance
LLMs' proficiency as judges for coding tasks using a curated, human-annotated dataset, exploring their evaluation capabilities and potential biases, including whether they favor their own generated solutions. Our findings underscore the potential of these benchmarks to advance LLM development and application in coding assistance.
- North America > United States (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > Ohio (0.04)
- (5 more...)
- North America > United States > Virginia (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
AI coding is now everywhere. But not everyone is convinced.
AI coding is now everywhere. But not everyone is convinced. Developers are navigating confusing gaps between expectation and reality. So are the rest of us. Depending who you ask, AI-powered coding is either giving software developers an unprecedented productivity boost or churning out masses of poorly designed code that saps their attention and sets software projects up for serious long term-maintenance problems. The problem is right now, it's not easy to know which is true. As tech giants pour billions into large language models (LLMs), coding has been touted as the technology's killer app. Both Microsoft CEO Satya Nadella and Google CEO Sundar Pichai have claimed that around a quarter of their companies' code is now AI-generated. And in March, Anthropic's CEO, Dario Amodei, predicted that within six months 90% of all code would be written by AI.
- Information Technology > Software (0.67)
- Information Technology > Services (0.48)
- North America > United States > Ohio (0.28)
- Europe > Germany (0.27)
- North America > United States > Texas > Travis County > Austin (0.27)
- (12 more...)
- Research Report > New Finding (0.92)
- Research Report > Experimental Study (0.67)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- (3 more...)
Automating API Documentation with LLMs: A BERTopic Approach
Developers rely on API documentation, but official sources are often lengthy, complex, or incomplete. Many turn to community - driven forums like Stack Overflow for practical insights. We propose automating the summarization of informal sources, focusing on An - droid APIs. Using BERTopic, we extracted prevalent topics from 3.6 million Stack Overflow posts and applied extractive summarization techniques to generate concise summaries, including code snippets. A user study with 30 Android developers assessed the summaries for coherence, relevance, informativeness, and satisfaction, show - ing improved productivity. Integrating formal API knowledge with community - generated content enhances documentation, making API resources more accessible and actionable work.
- North America > United States > New York > New York County > New York City (0.05)
- North America > Canada > Quebec > Montreal (0.05)
- Research Report (1.00)
- Questionnaire & Opinion Survey (0.93)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > Ohio (0.04)
- (5 more...)
- North America > United States > Virginia (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Virginia (0.05)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents
Thakur, Nandan, Lin, Jimmy, Havens, Sam, Carbin, Michael, Khattab, Omar, Drozdov, Andrew
We introduce FreshStack, a holistic framework for automatically building information retrieval (IR) evaluation benchmarks by incorporating challenging questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not improve first-stage retrieval accuracy (two out of five topics) and oracle context helps an LLM generator generate a high-quality RAG answer. We hope FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)